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Abstract 

Background: DIRS1 -like elements compose one superfamily of tyrosine recombinase-encoding retrotransposons. 
They have been previously reported in only a few diverse eukaryote species, describing a patchy distribution, and 
little is known about their origin and dynamics. Recently, we have shown that these retrotransposons are common 
among decapods, which calls into question the distribution of DIRS1 -like retrotransposons among eukaryotes. 

Results: To determine the distribution of DIRS1 -like retrotransposons, we developed a new computational tool, 
ReDoSt, which allows us to identify well-conserved DIRS1 -like elements. By screening 274 completely sequenced 
genomes, we identified more than 4000 DIRS1 -like copies distributed among 30 diverse species which can be 
clustered into roughly 300 families. While the diversity in most species appears restricted to a low copy number, a 
few bursts of transposition are strongly suggested in certain species, such as Danio rerio and Saccoglossus 
kowalevskii. 

Conclusion: In this study, we report 14 new species and 8 new higher taxa that were not previously known to 
harbor DIRS 1 -like retrotransposons. Now reported in 61 species, these elements appear widely distributed among 
eukaryotes, even if they remain undetected in streptophytes and mammals. Especially in unikonts, a broad range of 
taxa from Cnidaria to Sauropsida harbors such elements. Both the distribution and the similarities between the 
DIRS1 -like element phylogeny and conventional phylogenies of the host species suggest that DIRS 1 -like 
retrotransposons emerged early during the radiation of eukaryotes. 




Genomics 



Background 

The tyrosine recombinase (YR)-encoding elements con- 
stitute one of the major groups of retrotransposons 
[1,2]. These elements encode a YR that is required for 
the mechanism of integration into the genome [3], dis- 
tinguishing them from other retrotransposons {i.e., LTR 
retrotransposons, LINEs, SINEs and Penelope) [4]. 
DIRSl-like retrotransposons belong to the YR-encoding 
element superfamilies [5], whose constituents exhibit a 
unique structure made up of three ORFs and uncom- 
mon repeats (Figure 1). The first ORF encodes a puta- 
tive GAG protein, the second the YR, and the third a 
pol region composed of three distinct domains: a reverse 
transcriptase (RT), a RNase H (RH), and a methyltrans- 
ferase (MT). The function of this latter still remains 
unknown. Depending on the element considered, there 
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may be considerable overlap between the pol and the 
YR regions (Figure 1). The catalytic tyrosine recombi- 
nase domain is encoded by the non-overlapping 3'-end 
of the YR ORF. Many phylogenetic relationship analyses 
have shown that the RT/RH domains of DIRSl-like ret- 
rotransposons are closely related to those of Ty3/Gypsy 
LTR retrotransposons, suggesting that all these elements 
diverged from an ancient GAG -pol form of retrotran- 
sposon [5-7]. DIRSl-like elements are bounded by 
Inverted Terminal Repeats (ITRs) and harbor two Inter- 
nal Complementary Regions (ICRs). The two ICRs 
located at the 3'-end of the element appear to overlap 
on a 3-bp motif called the circular junction. As the left 
ICR is inverse-complementary to the beginning of the 
left ITR so is the right ICR to the end of the right ITR, 
but the latter also appears complementary to an exten- 
sion of the right ITR that is called the right Extension 
(rE) [1]. Given these unusual features, an integration 
model has been proposed [3,5] in which the ITRs' extre- 
mities match with their respective ICR. The junction of 
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Figure 1 Structure of the DIRS1 reference element identified in the slime mold Dictyostelium discoideum The three ORFs encoding the 
GAG, tyrosine recombinase (YR) and pol (Reverse Transcriptase (RT) - RNase H (RH) - MethylTransferase (MT) domain series) regions correspond 
to shaded boxes. The two Inverted Terminal Repeats (ITRs) are represented by the outer triangles. The two Internal Complementary Regions 
(ICRs) correspond to the inner triangles. The rE at the 3'-end of the element is represented by the red box. The positions of the three alignment 
profiles used to screen for DIRS1 -like elements among genomes are symbolized by black bars under their respective domains (RT-, MT- and YR- 
encoding domains). 



the two ITRs results in the formation of a rolling-circle 
intermediate of the element. The element integration 
then occurs by recombination between the 3-bp ITR 
junction sequence (complementary to the circular junc- 
tion) and an identical sequence in the genome, which 
does not produce any target site duplications. Their 
unique structure distinguishes DIRSlTike retrotranspo- 
sons from other YR-encoding elements, also known as 
the DIRS order [2] that includes also the Ngaro, Viper 
and PAT elements. The Ngaro and Viper retrotranspo- 
sons are devoid of the MT domain and do not usually 
harbor ORF overlaps [6,8]. Elements from the PAT 
superfamily, the sister group of DIRSl-like retrotranspo- 
sons, differ most prominently in their repeats. The PAT 
retrotransposons (PAT-like elements, TOC elements 
and kangaroo) are bounded by some "Split" Direct 
Repeats (SDRs) and can contain tyrosine recombinase- 
encoding regions in an inverted orientation [5]. 

Transposable elements have been found in all eukar- 
yotic species investigated thus far [2], However, depend- 
ing on the superfamily or family of elements studied, 
they show different distributions among eukaryotes. For 
example, the Tyl/Copia, Ty3/Gypsy, LINEs, SINEs ret- 
rotransposons and the Tel/Mariner transposons, have 
been detected almost ubiquitously [2,7,9-11]. The Pene- 
lope retrotransposons are also abundant in many animal 
species, but seem to be rare among plants, protists and 
fungi [12]. In contrast to this, the Maverick transposons 
(also called Polintons) have been characterized by a 
highly patchy distribution in diverse eukaryote species, 
but not in plants [13,14]. Until recently, bibliographic 
data and automatic annotations have revealed the pre- 
sence of DIRSl-like retrotransposons only in 43 diverse 



eukaryote organisms (Table 1), mostly with a low diver- 
sity per species (up to four families in Strongylocentrotus 
purpuratus and three families in Danio rerio [1,5]) with 
the notable exception of Xenopus tropicalis (73 families 
deposited in Repbase [15]). They were not described in 
several well-studied groups (e.g., plants and mammals), 
and are absent from model organisms such as Saccharo- 
myces cerevisiae and Drosophila melanogaster. The 
DIRSl-like retrotransposons appear widely distributed 
among decapod crustaceans [16]. These elements were 
previously detected using PCR approaches in 16 deca- 
pod species, including some shrimps, lobsters, crabs and 
galatheid crabs. The wide distribution among decapods 
and the continuous identification of elements in new 
species with the emergence of large-scale genome 
sequencing call into question their supposedly patchy 
distribution among eukaryote species. 

We aim to determine the distribution of DIRSl-like 
retrotransposons among eukaryotes using an in silico 
approach. In the post-genome era, several automatic 
annotation tools have been developed to detect the pre- 
sence of particular types of transposable elements in 
genomes. The conventional approaches are based on 
similarity searching using the RepeatMasker program 
[17]. However, transposable elements often correspond 
to ancient genome components. Many copies even 
within the same family appear fragmented and divergent 
in nucleotide sequences due to several punctual muta- 
tions, rearrangements, and insertions or deletions 
(indels). Similarity searching-based programs are effi- 
cient in identifying copies closely related to those pre- 
viously reported in the library, but they often appear 
inefficient in detecting very divergent copies or 
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Table 1 Survey of the eukaryote species in which DIRS1- 
like retrotransposons were previously detected 



Higher taxon 


Species 


References 


Actinistia 


Latimeria menodoensis 


Repbase 1 




Donio rerio 


[1,8] 




Oncorhynchus mykiss 


GenBank (2006) 


Actinopterygii 


Sot mo so lor 


GenBank (2006) 




Tokifugu rubripes 


[46] 




Tetraodon nigroviridis 


[1] 


Amoebozoa 


Dictyosteiium discoideum 


[47] 




Xenopus loevis 


[1] 


Amphibia 


Xenopus tropicoiis 


[1] 


Cnidaria 


Nemotostello vectensis 


[48] 


Crustacea 


Dophnio pulex 


[25,24] 




16 decapod species 


[16] 


Dinoflagellata 


Perkinsus morinus 


GenBank (2010) 




Arbocio punctulota 


[21] 


Echinodermata 


Lytechinus variegates 


[8] 




Strongyiocentrotus purpurotus 


[1] 


Hemichordata 


Soccogiossus kowolevskii 


GenBank (2010) 




Apis meiiifero 


[21] 




Componotus floridonus 


[27] 




Giyptoponteles indiensis 


GenBank (2008) 


Hexapoda 


Harpegnothos soltator 


[27] 




Nasonia vitripennis 


GenBank (2007) 




Solenopsis invicto 


[28] 




Tribolium castoneum 


[21] 


Mucoromycotina 


Phycomyces blokesleeanus 


[49] 




Rhizopus oryzoe 


[8] 


Sauropsida 


Gopherus agossizii 


[5] 


Urochordata 


Oikopieura dioico 


[50] 



All the detected DIRS1 -like elements, even in partial sequences, are reported 
here. 

Notes: Repbase: version 14.06 (http://www.girinst.org/repbase/update/index. 
html). 



unknown elements [18]. Other in silico approaches have 
been developed to detect particular types of elements. 
These programs, such as LTRharvest [19], are not based 
upon similarity searching but on specific signature 
searches (e.g., the nature of the termini and the presence 
of target site duplications). While some programs have 
been developed to detect LTR retrotransposons or 
transposons, none have been developed for DIRS 1 -like 
retrotransposons. Such a program might appear ineffi- 
cient in identifying divergent DIRS 1 -like retrotranspo- 
sons because the training dataset that is currently 
available for these elements remains too limited (only 18 
reference elements with detectable ITRs for example). 
Some de novo approaches that detect more divergent 
transposable elements, such as RECON [20], have been 



developed to exhaustively report the content of repeated 
sequences within genomes. To identify a specific type of 
element, many investigations of this report must be per- 
formed, such as similarity searching. For the same rea- 
sons as those given for similarity searching-based 
methods, such approaches could appear inappropriate 
for studying the distribution of the DIRSl-like 
retrotransposons. 

We hereby present a new computational approach 
specifically dedicated to the identification of DIRSl-like 
retrotransposons among genomes that we called 
ReDoSt. Our method is based on both the detection of 
the structure of these elements and on sequence similar- 
ity searches performed using alignment profiles designed 
on coding domains. It has the advantages of not consid- 
ering the element copy number and of avoiding any pre- 
conception of the ITRs (length or sequence identity). 
With our method we analyzed 274 completely 
sequenced genomes, which allowed for a high coverage 
of eukaryotic diversity, especially plants and unikonts. 

We have identified more than 4000 element copies 
that can be clustered into approximately 300 new 
families. We report the first DIRSl-like element copy 
number estimate among many genomes and we evaluate 
the diversity within the DIRSl-like superfamily. Their 
distribution appears wider than it was previously 
thought, especially in unikont species. Sequence analyses 
confirmed the presence of well-conserved DIRSl-like 
retrotransposons in 28 species, including at least 14 spe- 
cies that were not previously known to host such ele- 
ments, and allowed us to define a more precise 
structure of the DIRSl-like retrotransposons, especially 
in their terminal repeats. 

Results and Discussion 

Identification of putative DIRSl-like retrotransposons in 
eukaryote genomes 

To study the distribution of DIRSl-like retrotransposons 
among genomes, we developed a new computational 
tool that we call ReDoSt (Retrotransposon Domain and 
Structure). The element detection is mainly based on 
independent similarity searches against co-oriented and 
well-ordered RT-, MT- and YR-encoding domains 
within a single 10-kb genomic fragment (see Methods). 
So, the DIRSl-like copies detected with ReDoSt may be 
considered as well-conserved {i.e. with the simultaneous 
recognizable presence of these three characteristic 
domains), which suggests that they may still be active, 
or have moved only recently. Thus, relics and highly 
degenerate elements are not considered here. 

Using ReDoSt, we identified 4310 copies of putative 
DIRSl-like elements distributed among 32 diverse spe- 
cies out of the 274 well-sequenced genomes tested 
(Table 2). A wide spectrum of eukaryote species is 
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Table 2 Results of DIRS1 -like retrotransposon detection and clustering 



Higher taxon 


Species 


Copy number 


Family number 


Min Max 


Reference 




Danio rerio * 


2091 


1 -1 


1 1157 


a 




Gasterosteus acuieatus 


21 


A 


1 1 2 


b 


Actinopterygii 


Oryzias iatipes 


6 


1 




C 




Takifugu rubripes * 


7 


1 




d 




Tetraodon nigroviridis * 


8 


2 


1 7 


b 


Amoebozoa 


Dictyostelium discoideum * 


16 


1 




[51] 




Acantheamoeba sp. 


1 


1 




e 


Am phibia 


Xenopus tropicolis * 


692 


81 


1 38 


[52] 


Annelida 


Capitelio sp. / 


5 


2 


1 4 


r 
t 


B 1 a stoc 1 a d io my cota 


Allomyces macrogynus 


21 


6 


1 1 0 


b 


Cephalochordata 


Branchiostomo floridae 


15 


1 1 


1 3 


[53] 


Chlorophyta 


Chlamydomonas reinhardtii 


1 1 


5 (3) 


1 4 


[54] 




Voivox corteh 


36 


6 (4) 


2 1 3 


[55] 


Cnidaria 


Nematosteila vectensis * 


60 


21 (1) 


1 7 


[48] 


Crustacea 


Dophnio pulex * 


100 


39 


1 5 


[56] 


Echinodermata 


Strongyiocentrotus purpurotus * 


A 


A 




e 


Haptophytes 


Emiiiona huxleyi 


1 


1 




r 

r 


Hemichordata 


Saccoglossus kowalevskii * 


2-10 


8 (1) 


1 1 75 


e 


Heterolobosea 


Naegieria gruberi 


7 


6 


1 2 


[57] 




Bombyx mori 


6 


2 


3 3 


[58] 


Hexapoda 


Nasonio vitripennis * 


37 


1 8 


1 4 


e 




Tribolium castaneum * 


1 


1 




e 


Mucoromycotina 


Mucor circinelioides 


3 


2 


1 2 


r 
T 




Phycomyces blakesieeonus * 


28 


13 


1 5 


r 
t 




nl itzupui UiyZUc 


24 


1 1 


1 4 




Mollusca 


Aplysia californica 


39 


7 


2 10 


b 




Lottia gigontea 


44 


22 (1) 


1 5 


f 


Nematoda 


Caenorhabditis briggsoe 


1 


1 (D 




g 




Pristionchus pocificus $ 


4 


3 (3) 


1 2 


g 


Petromyzontida 


Petromyzon marinus 


2 


2 




g 


Sauropsida 


Anolis corolinensis 


775 


42 


1 319 


b 


Urochordata 


Oikopleura dioica * 


4 


2 


1 3 


h 



For each species, the number of sequences detected using ReDoSt and the number of families obtained with the MCL program are given. When they are 
informative, the minimum (Min) and the maximum (Max) numbers of sequences included in a family are provided. Species in which the presence of DIRS1 -like 
elements was previously reported (cf. Table 1) are indicated with an asterisk. In the family number column, numbers in brackets indicate the number of families 
that we characterized as PAT-like elements. The two Nematoda species that comprise only PAT-like elements are indicated with a dollar. The clustering was 
performed on all sequences detected in the 32 species. The families shared by several species are represented several times in the table. 
Notes: a: The zebrafish genome sequencing project at the Sanger Institute (http://www.sanger.ac.uk/Projects/D_rerio/) funded by the Wellcome Trust, b: The 
Broad Institute of Harvard and MIT ( http://www.broadinstitute.org/). c: The National Institute of Genetics and the University of Tokyo (http://medakagb.lab.nig.ac. 
jp/OryziasJatipes/index.html). d: The Institute of Molecular and Cell Biology (http://www.fugu-sg.org/). e: The Baylor College of Medicine Human Genome 
Sequencing Center (http://www.hgsc.bcm.tmc.edu/project-species-x-organisms.hgsc). f: The U.S. Department of Energy Joint Genome Institute (http://www.jgi.doe. 
gov/genome-projects/). g: The Genome Institute at Washington University School of Medicine in St. Louis (ftp://genome.wustl.edu/pub/organism/). h: The 
Genoscope (http://www.genoscope.cns.fr/externe/GenomeBrowser/Oikopleura/) 

represented in which some taxa are characterized for the 
first time as harboring DIRS 1 -like retrotransposons. For 
example, we observed the first DIRS 1 -like elements in 
Mollusca (Aplysia californica and Lottia gigantea). Inter- 
estingly, DIRS 1 -like retrotransposons can be detected in 
all the species in two higher taxa, Actinopterygii and 
Mucoromycotina. ReDoSt was able to detect DIRSl-like 



elements in all species already described in the literature 
except those harbor in the honey bee Apis melifera gen- 
ome. This discrepancy is due to the fact that this gen- 
ome contains only remnant fragments of DIRSl-like 
elements that ReDoSt is unable to detect [21]. 

As expected, the identified elements seem to be well- 
conserved. The length of the three detected domains 



Piednoel et al. BMC Genomics 201 1, 12:621 
http://www.biomedcentral.eom/1 471 -2 1 64/1 2/621 



Page 5 of 18 



appears highly constrained within the elements of a 
given genome. For example, in the Sauropsida Anolis 
carolinensis genome, almost all RT-, MT- and YR- 
encoding fragments have a length ranging from 360 to 
380 bp, 300 to 320 bp, and 900 to 940 bp, respectively 
(Additional File 1). This pattern is present in most gen- 
omes, with the notable exception of Saccoglossus kowa- 
levskii, which varies considerably in its domain length 
(Additional File 1), possibly because of multiple large 
fragment deletions. 

Considering the repartition of the 4310 copies 
detected in 32 eukaryotes, the copy number per genome 
appears highly variable (Table 2), even within some of 
the higher taxa examined. In Actinopterygii, the low 
copy numbers detected in Oryzias latipes, Takifugu 
rubripes, Tetraodon nigroviridis and Gasterosteus aculea- 
tus (6, 7, 8, and 21 copies, respectively) contrast with the 
2091 copies identified in D. rerio. Conversely, in Mucor- 
omycotina, Mucor circinelloides has ten times fewer 
copies than other related species. The copy number per 
genome is usually relatively low, illustrated by the fact 
that half of the species harbor fewer than 8 copies. 
Twelve species show between 10 and 60 copies and only 
5 species harbor more than 100 copies (D. pulex, S. 
kowalevskii, X. tropicalis, A. carolinensis and D. rerio). 
This suggests that the more or less recent element activ- 
ity is relatively low, resulting either from the inactivation 
of most genomic copies or from a strong regulation of 
the copy number. The loss of elements in some higher 
taxa or species could be facilitated by this low copy 
number. However, the relatively low copy number 
observed in genomes has to be conservative since only 
well-conserved copies are considered based on the three 
coding domains studied. For example, similarity searches 
on Acantheamoeba sp. allowed us to reveal 29 more 
degenerate sequences related to the unique element 
detected using ReDoSt (data not shown). 

To our knowledge, the copy number has only been 
previously estimated in two genomes: the slime mold 
Dictyostelium discoideum and the crustacean Daphnia 
pulex. In D. discoideum, the previous copy number 
estimation of DIRSl-like retrotransposons suggested 
40 full-size elements and around 200 incomplete 
copies [22]. Our detection tool results in the identifica- 
tion of 16 well-conserved copies. This result seems 
consistent with the previous estimation considering the 
difference in the methods used. The previous analysis 
estimated the copy number with quantitative South- 
ern-blot experiments using the complete DIRSl-like 
sequence as a probe. For this reason it may detect 
more altered elements than our tool does. This is espe- 
cially the case with the nested elements [23] that 
amplify the signal in Southern blots but are by default 
considered to be a unique copy by in silico ReDoSt 



analysis (see Methods). In D. pulex, the DIRSl-like 
copy number has been previously estimated at 218 
[24], including only 19 intact copies (i.e., uncorrupted 
sequences and conserved ITRs) [25]. This estimation 
also seems consistent with our results (100 copies 
detected), as ReDoSt identifies well-conserved elements 
but is not limited to intact copies. 

The diversity of DIRSl-like retrotransposons 

To study the diversity of the DIRSl-like elements, we 
use the MCL program to cluster into families all the 
sequences that were detected with ReDoSt as well as 
reference elements. The parameter values used to clus- 
ter in the MCL program were empirically estimated to 
discriminate each of the DIRSl-like families previously 
described (e.g., DrDIRSl, DrDIRS2 and DrDIRS3 in D. 
rerio). Based on the sequence identity, the clusters 
obtained on the reverse transcriptase-encoding 
sequences using the MCL program are considered to 
correspond to different DIRSl-like families. For exam- 
ple, the sequence identities among the largest cluster in 
A. carolinensis (319 sequences) range from 57% to 
100%, with an average sequence identity of 81%. Such a 
relatively high nucleotide sequence divergence is similar 
to those observed in reverse transcriptases encoded by 
non-LTR retrotransposons and in some DNA transpo- 
sases. The cluster number obtained in each genome 
reflects the diversity of DIRSl-like elements. 

A total of 287 families were found distributed 
unevenly among the genomes of the 32 species exam- 
ined (Table 2). Most of the families seem restricted to 
only one species with the notable exception of Mucoro- 
mycotina species for which several interspecific families 
are obtained. Some species show very low element 
diversity in comparison to their copy number. For 
example, all 16 copies detected in D. discoideum 
grouped into a single family. On the other hand, few 
species show very high element diversity. For example, 
S. purpuratus harbors 4 copies distributed among 4 
families. Likewise, the 14 copies of B. floridae are split 
into 11 families. The distribution of copy number per 
family shows two major profiles according to species 
(Figure 2 and Additional File 2). Comparing the two 
vertebrate species X. tropicalis and A. carolinensis, both 
of which harbor high copy and family numbers, the 
Western clawed frog contains families almost equal in 
size whereas the lizard contains two families that 
together include 64% of the copies. The two fungi Rhi- 
zopus oryzae and Allomyces macrogynus have only about 
20 copies, which are well distributed in R. oryzae while 
half of the copies of A. macrogynus belong to one 
family. Finally, in D. rerio, which harbors the highest 
copy number, 96% of the 2091 copies belong to just 
three families (1157 and 767 copies for DrDIRSl and 
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Figure 2 Distribution of family size in five representative species. Families are arranged along a gradient of decreasing size. For each 
species, mean family size and standard deviation are given. X-axis: family rank, Y-axis: number of elements in the family. 



DrDIRS2, respectively). Such a distribution with a high SkoDIRSl family alone accounts for 175 of the 240 

copy number restricted to few families could be related copies identified) and in A. carolinensis (AcDIRSl and 

to bursts of transposition. Bursts of DIRSl-like element AcDIRS2 families together harbor more than 60% of the 

activity are also suspected in S. kowalevskii (the different copies). 



Piednoel et al. BMC Genomics 201 1, 12:621 
http://www.biomedcentral.eom/1 471 -2 1 64/1 2/621 



Page 7 of 18 



Phylogenetic analysis of DIRS1 -like retrotransposons 

To infer the relationships among the various members 
of DIRSl-like superfamily, we constructed a phyloge- 
netic tree (Figure 3) based on an alignment of amino 
acid pol region sequences (214 sites). This phylogenetic 
tree contains 114 sequences, including a representative 
sequence of each family that has at least one uncor- 
rupted copy, 23 DIRSl-like or PAT-like reference ele- 
ments and 4 Ty3/Gypsy elements used as outgroups. 
Preliminary analysis of the three genomes that present 
high family numbers (42 families in A. carolinensis, 39 
in D. pulex, and 81 in X tropicalis) has shown that all 
of the elements from a given species cluster together 
into a monophyletic group (data not shown). For these 
species, only representative elements from the 4 or 5 
largest families were included in the phylogenetic analy- 
sis. In contrast to previous analyses on much smaller 
datasets, the monophyly of DIRSl-like elements is not 
supported in the present study (bootstrap support lower 
than 75%). Such a pattern could be an artifact of a data- 
set that is too large and includes divergent elements. 
Alternatively, it might suggest that the PAT elements 
belong to the DIRSl-like superfamily, representing a 
peculiar group because of their structure. Many well- 
supported groups can be identified within the DIRSl- 
like elements. In many cases, the elements from a given 
species form a monophyletic group (e.g., elements from 
Nasonia vitripennis, D. pulex or A. carolinensis). How- 
ever, some species harbor elements from two or three 
different groups (e.g., two and three element groups in 
A. californica and L. gigantea, respectively). In the same 
way, each group usually integrates elements from the 
same species or from a few closely related ones. For 
example, all the elements identified in fishes belong to 
one group called DrDIRSl [21]. Likewise, the fungi 
group 1 comprises most of the elements identified in 
fungi, a result that confirms the close relationships 
between most fungi DIRSl-like elements revealed by the 
MCL analysis. Despite the difficulty in resolving the 
relationships among the different DIRSl-like groups, the 
monophyletic groups comprising only elements from a 
species or related species, the tree topology appears 
absent of clear evidence of horizontal transfer. 

Discriminating the PAT-like sequences included in the 
final dataset 

The PAT-like retrotransposons are the sister group of 
DIRSl-like elements and show a similar structure with 
the exception of their termini [6]. To discriminate the 
putative PAT-like elements retained by ReDoSt, 5 PAT- 
like reference sequences were included during the clus- 
tering process and the phylogenetic analysis (Figure 3). 
This allowed us to determine that 11 families corre- 
spond to PAT-like retrotransposons (Table 2). This 



includes 6 families from the chlorophytes (Chlamydomo- 
nas reinhardtii and Volvox carteri), 3 families from the 
nematodes {Caenorhabditis briggsae and Pristionchus 
pacificus), one family from L. gigantea, and one shared 
by Nematostella vectensis and S. kowalevskii. 

The presence of DIRSl-like retrotransposons is con- 
firmed in 25 species, but still remains uncertain in 
Emiliana huxleyi, Petromyzon marinus, Naegleria gru- 
beri, P. pacificus, V. carteri and C. reinhardtii. Elements 
from these species do not cluster with any reference ele- 
ments and their sequences harbor too many frameshifts 
or indels to be included in our phylogenetic analysis. 
For these elements, we checked the presence of DIRSl- 
like elements using similarity searches using the 
TBLASTX program [26] and the Repbase database that 
we previously re-annotated for the DIRSl-like and PAT 
elements (data not shown). A family was assigned to the 
DIRSl-like element superfamily under the two condi- 
tions: (i) an E-value lower than le-20 with at least one 
DIRSl-like reference element; and (ii) a minimum dif- 
ference between the best E-values obtained with DIRSl- 
like and PAT reference elements of le-10. Under these 
criteria, the presence of DIRSl-like retrotransposons 
could be confirmed in V. carteri, P. marinus and N. gru- 
beri, but remains uncertain in C. reinhardtii and E. hux- 
leyi, whereas the element detected in P. pacificus 
appears to be a PAT-like retrotransposon. So, 30 of the 
32 species revealed by ReDoSt are now considered as 
harboring DIRSl-like retrotransposons and the two 
remaining posses in fact only PAT elements. 

Distribution of DIRS1 -like elements among eukaryotes 

DIRSl-like retrotransposons are now described in 61 
diverse eukaryote species (Figure 4), including 14 species 
in 8 higher taxa newly characterized using ReDoSt: 
annelids, blastocladiomycetes, cephalochordates, chloro- 
phytes, heteroloboseans, molluscs, petromyzontids and 
sauropsids. The DIRSl-like element distribution does 
not seem to be as patchy as it was previously described. 
Sixteen of the 28 unikont groups tested revealed the 
presence of these elements, indicating a wide distribu- 
tion. This distribution could be shown to be wider in 
the near future since seven of the unikont groups appar- 
ently devoid of DIRSl-like elements are currently repre- 
sented by only one or two completely sequenced 
genomes. Conversely, four other unikont groups seem 
to be clearly devoid of DIRSl-like elements. Despite a 
high number of completely sequenced genomes and 
diverse taxa tested, no well-conserved copies could be 
identified in any ascomycetes (75 species), basidiomy- 
cetes (16 species), nematodes (12 species) or mammals 
(37 species). A specific loss of DIRSl-like elements in 
Mammalia during evolution is the most probable cause 
of their absence when one takes into consideration their 
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Figure 3 Rooted phylogenetic tree based on the pol amino acid sequences of the DIRS1 -like families identified Distances are calculated 
with parameter model plus gamma distribution's correction for amino acids. The tree is constructed using the Neighbor Joining method 
and pairwise deletion of gaps option included in MEGA5.0 software. When possible, one representative copy sequence that required only minor 
corrections for each family was integrated into our analysis. Reference elements are labeled with an asterisk and clusters that correspond to an 
element annotated in this study are written in bold italics. If a reference element was included in a family, this sequence was chosen to 
represent the family. In the cases of Anolis caroiinensis, Daphnio pulex and Xenopus tropicolis, species that show a high family number, only four 
or five of their most abundant families were integrated. Ty3/Gypsy element sequences were used as outgroups according to the close 
relationships of their reverse transcriptase and RNase H domains with those of DIRS 1 -like and PAT retrotransposons. Support for individual 
groups was evaluated with non-parametric bootstrapping using 100 replicates. Only bootstrap node values over 50% are represented. 
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Figure 4 Distribution of Dl RS 1 -like elements among the eukaryote groups tested Species phylogeny was redrawn from [42-45]. 
Chromalveolate species are shown with red branches, excavates with purple branches, plants with green branches, and unikonts with blue 
branches. Groups in which DIRSI-like elements were detected and confirmed are shaded in yellow. Groups in which the presence of DIRS 1 -like 
element remains uncertain are shaded in green. In each group, we include in parentheses the number of species harboring DIRSI-like 
sequences (in blue) compared to the number of species analyzed (in red), as well as the number of species not screened in this study in which 
DIRSI-like retrotransposons were previously reported (in purple). B: Bilateria, D: Deuterostomia, F: Fungi. 



wide distribution in Unikonta, especially Deuterostomia. 
Outside of unikonts, DIRSI-like retrotransposons appear 
infrequently, observed in only three groups, even though 
most groups are represented by relatively few species. 

Various distributional patterns can currently be 
observed among eukaryotes. On a large phylogenetic 
scale, we make two observations: (i) a wide distribution 
of DIRSI-like elements among groups such as deuteros- 
tomes, with the detection of copies in a wide range of 
higher taxa from Echinodermata to Sauropsida; and (ii) 
a large repartition of the DIRSI-like elements observed 



in certain taxa despite a lack of detection in closely 
related taxa. In fungi, all three Mucoromycotina gen- 
omes were found to harbor DIRSI-like elements, 
whereas none could be detected in Ascomycota and 
Basidiomycota. On a smaller phylogenetic scale (i.e., 
within a higher taxon), the distribution again appears to 
be taxon-dependent with three distinguishable patterns. 
As described above, some groups seem to possess no 
DIRSI-like retrotransposons (e.g., mammals and strepto- 
phytes). Second, a large repartition of DIRSI-like ele- 
ments was observed in some groups such as in 
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Actinopterygii and Mucoromycotina (detection in all 5 
and 3 genomes tested, respectively). Finally, a sparser 
distribution of DIRS 1 -like elements was observed in yet 
other groups. Only 3 of the 22 hexapod species tested 
harbor well-conserved elements. However, this heteroge- 
neous distribution could result in part from a sampling 
bias. We observed a lack of elements in some overrepre- 
sented taxa, such as Diptera (absence of detection in 16 
Drosophila species tested), and an abundance in others, 
such as Hymenoptera (in three wasp and five ant spe- 
cies). Indeed, we used ReDoSt to analyze the recently 
released ant genomes, all of which harbor DIRSl-like 
elements. Five copies were found in Camponothus flori- 
danus, 22 in Pogonomyrmex barbatus, 37 in Harpeg- 
nathos saltator, 41 in Linepithema humile, and 57 in 
Solenopsis invicta [27-31]. 

The previous though that DIRSl-like retrotranspo- 
sons are uncommon among eukaryotes appears to be 
strongly biased considering that ascomycetes, mam- 
mals and green plants, which are devoid of elements, 
represent more than 55% of the sequenced genomes. 
DIRSl-like elements do not appear as ubiquitous as 
Tyl/Copia and Ty3/Gypsy retrotransposons but their 
distribution among eukaryotes appears more compar- 
able to the Penelope element distribution [12,13]. 
Despite their loss in several lineages, the phylogenetic 
analysis and the distribution of DIRSl-like elements in 
a very broad range of unikonts indicate that their 
genomic invasion occurred early in unikont evolution; 
at least prior to the Bilateria radiation but probably 
before if we take into account the presence of DIRSl- 
like retrotransposons in Amoebozoa and Fungi (Figure 
4). This primary invasion could be found to have 
occurred earlier in evolution if the presence of DIRSl- 
like elements is confirmed in Excavata, Plantae and 
Chromoalveolata. Though our results unequivocally 
indicate the presence of DIRSl-like elements in Uni- 
konta, we must be cautious in our estimation of their 
real distribution in Excavata and Plantae because most 
of the copies identified in these taxa harbor too many 
indels and frameshifts in the repeated sequence struc- 
tures to be studied and for them to be included in the 
phylogenetic analysis. The presence of DIRSl-like ele- 
ments in these species is only supported by similarity 
search analyses. 

The absence of DIRSl-like elements in several groups 
may reflect their differential success in adapting to dif- 
ferent host species and/or a propensity for stochastic 
loss during evolution. Nevertheless, this absence has to 
be confirmed in the future by investigations of deleted 
DIRSl-like copies in these genomes. The detection of 
deleted copies in an apparently "unoccupied" species 
would be evidence of the previous existence of well-con- 
served DIRSl-like elements. 



In-depth characterization of new DIRS1 -like elements 

To describe the diversity within the DIRSl-like super- 
family, we detailed the structure of 28 new elements, 
most of which represent high copy number families or 
species newly characterized for the presence of such ret- 
rotransposons (e.g., A. californica and L. gigantea). Sev- 
eral features of DIRSl-like retrotransposons are 
presented in Table 3, such as their length, the presence 
of a long ORF overlap, and the structure of their repeats. 
The length of DIRSl-like retrotransposons appears vari- 
able between the 28 elements from 3974 bp in Acas- 
DIRS1 {Acantheamoeba sp.) to 6283 bp in SkoDIRS2 (S. 
kowalevskii), with an average length of 5160 bp. In-depth 
annotation including the positions of the repeated 
sequences and several conserved motifs is provided in 
Additional File 3. The pol motifs seem to be highly con- 
served, especially the 'YL/IDD' motif that is conserved in 
25 of the 28 annotated elements. The 'HSTR' tyrosine 
recombinase motif appears more variable (only harbored 
by 13 of the 28 elements). For example, AmDIRS2 and 
MciDIRSl harbor an 'SDLIC and 'LCPV sequence, 
respectively. This suggests that the catalytic tyrosine 
recombinase-encoding domain sequence could be less 
constrained than the pol sequence. Twenty-three of the 
elements begin and end with a trinucleotide NTT, most 
frequent being ATT (Table 3). Only the AmDIRSl from 
A. macrogynus begins and ends with an uncommon GC- 
rich motif. In almost all elements, this trinucleotide 
appears complementary to the 3-bp circular junction. 
Evidence of long ORF overlaps was found in half of the 
28 DIRSl-like elements, which seems to depend on host 
species (e.g., evidence in the five elements from Fungi 
and none in Mollusca). 

Previous studies have outlined the structure of DIRSl- 
like retrotransposons, especially the nature of their ter- 
mini, which complement the Internal Complementary 
Regions (ICRs), and the presence of a right Extension 
sequence (rE) [3]. Looking in detail at the repeated 
sequences "UTR-lICR-rlCR-rlTR-rE" in these elements 
allowed us to reveal a rather more complex structure 
(Figure 5). Whereas previous studies only allowed the 
description of a rE sequence, we have characterized an 
equivalent left Extension sequence (IE) at the 5'-end of 
some elements, which is only complementary to the left 
ICR. The identification of this additional IE sequence 
does not challenge the replication model that proposes a 
rolling-circle intermediate. This intermediate is pro- 
duced by the 3-bp circular junction that corresponds to 
the overlap of the two ICRs complementary to the 5'- 
and 3'- ends of the element [3-5]. All elements harbor 
at least one extension, and, like DIRS1, most elements 
contain only a rE. The IE region has only been detected 
in fungi and amoebozoa species. Two elements show 
only a IE (AcasDIRSl, AmDIRSl) and four other 
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Table 3 Annotation of 28 DIRS1 -like retrotransposons 





















Termini 


size 




ICR 


Element 


Host 




Size 


start 


end 


circular 
junction 


long ORF 
overlap 


IE 


divergent 
ITR 


conserved 
ITR 


rE 


size 


AcDIRSI 


A. carolinensis 


Sauropsida 
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The element size and trinucleotide sequences beginning and ending the element complementary to the circular junction are given for each manually annotated 
DIRS1 -like element. Evidence of long ORF overlap is also indicated. The lengths of the different parts of the termini {the divergent and the conserved ITRs, IE and 
rE) as well as those of the ICRs are reported, nd: not determined because CaspDIRSI corresponds to a chimeric sequence. Each newly identified element has 
been submitted to Repbase. 



elements harbor the two extensions (e.g., AmDIRS2, 
MciDIRSl). We hereby propose to redefine the fine 
structure of the DIRSl-like element's termini (Figures 5 
and 6). In this study we call the left and right termini 
(ITer and rTer) the assembly of the two components: 
the ITRs and their respective potential extension (IE or 
rE). The IE and rE regions are considered the external 
sequences of the termini that are only complementary 
to their respective ICR sequences (theoretically 100% 
sequence identity). The ITRs are defined as the parts of 
these terminal sequences that are mostly complementary 
to each other. On a smaller scale, two parts can be dis- 
tinguished within these ITRs (Figure 6). In the con- 
served ITR part, the two ITRs are strictly 



complementary to each other. In the divergent ITR part, 
the two ITR sequences are mostly constrained by their 
respective ICR and remain only partially complementary 
to each other, with a sequence identity that varies from 
50% to 85%. ITR length appears highly variable among 
the different elements, ranging from 66 bp (LgDIRSl) to 
316 bp (DIRS1-2). Likewise, the length of the ICRs var- 
ies between 85 bp for the sum of the two ICRs in Acas- 
DIRS1 and 130 bp in AcaDIRSl. The right extensions 
vary from 9 bp to 75 bp (being apparently shorter in the 
presence of a IE). In most cases, the sizes of the various 
repeats are conserved among the different elements 
from the same species (e.g., among AcDIRSl and 
AcDIRS2, or AcaDIRSl and AcaDIRS2). The conserved 
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Figure 5 Structure of repeated sequences observed within DIRSI-like retrotransposons The left (ITer) and right (rTer) termini as well as 
the left (NCR) and right (rlCR) ICRs are represented by triangles. ITRs are represented by bubbles. Blue items correspond to the complementary 
ITer and the NCR sequences. Green items correspond to the complementary rTer and the rlCR sequences. Colored parts of the ITer and rTer that 
do not overlap red bubbles symbolize the IE and rE sequences. A: Element harboring only one extension (here the rE sequence). In this case, the 
ICR is only complementary to the left ITR (IITR); B: Elements harboring both the IE and rE sequences. 



ITR usually represents the largest part of the ITR, ran- 
ging from 31 bp to 304 bp (Table 3), whereas the diver- 
gent ITR is often small, ranging from 9 bp to 36 bp. 
However, in some elements from molluscs both parts 
have about the same size. Interestingly, the boundary 
between these two ITR parts is composed of a short 
sequence of at least 10 nucleotides that is conserved in 
two ITRs and two ICRs (Figure 6), which may be 
involved in the formation of the circular intermediate of 
the element before its integration. 

Conclusions 

In this study, we developed a new computational tool, 
ReDoSt, allowing us to describe more precisely the 



distribution of DIRSI-like retrotransposons as well as 
their diversity among eukaryote genomes. These ele- 
ments appear more continuously distributed than pre- 
viously though, with 8 new higher taxa characterized to 
harbor these elements (e.g. Mollusca) and 14 new eukar- 
yote species, giving a total of 61 species containing 
DIRSI-like elements in their genome. The current 
understanding of the distribution of DIRSI-like elements 
in Eukaryota, and especially Unikonta, suggests the pre- 
sence of DIRSI-like elements in the last common ances- 
tor of eukaryotes. Whereas some higher taxa seem 
clearly devoid of well-conserved DIRSI-like retrotran- 
sposons (e.g., ascomycetes, mammals and streptophytes), 
these elements appear highly conserved in some other 
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Figure 6 Alignment of the nucleotide sequences of RoDIRSI element termini and ICRs. In this alignment the left terminus (ITer) and the 
right ICR (rICR) sequences are represented with the inverse complementary sequences of the right terminus (rTer) and the left ICR (I ICR) 
sequences. The left and right extensions (IE and rE) of the terminal regions are indicated by orange and brown boxes. The portion called 
"divergent ITRs" represents the part of the ITRs that is mostly complementary to the corresponding ICRs. The region that is conserved among 
the four different sequences is underlined in purple. 



higher taxa, such as Actinopterygii and Mucoromyco- 
tina. Now that a large diversity of elements within the 
DIRS 1 -like superfamily (around 300 different families) 
have been characterized, it is possible to screen 
sequence datasets for the presence of DIRS 1 -like ele- 
ments using more conventional approaches like Repeat- 
Masker. This large diversity allowed us to study the 
phylogenetic relationships within the DIRSl-like super- 
family in which the different groups appear related to 
the host species. All of the elements included in the 
phylogenetic analysis as well as the subset of 28 anno- 
tated elements were used to define two new alignment 
profiles for each of the three characteristic domains of 
the DIRSl-like retrotransposons: reverse transcriptase, 
methyltransferase and tyrosine recombinase. These pro- 
files could be used in further studies or in future auto- 
matic annotation of transposable elements within 
genomes (Additional file 4). 

Methods 

Data collection 

The 274 complete or draft genomic sequences were 
downloaded from eight different databases: the DOE 
Joint Genome Institute (http://www.jgi.doe.gov/), the 
Broad Institute of MIT and Harvard (http://www.broad- 
institute.org/), the Human Genome Sequencing Center 
at the Baylor College of Medicine (http://www.hgsc.bcm. 
tmc.edu/), the Genome Center at Washington University 



(http://genome.wustl.edu/), the National Center for Bio- 
technology Information (http://www.ncbi.nlm.nih.gov/), 
the Wellcome Trust Sanger Institute (http://www.san- 
ger.ac.uk/), Genoscope (http://www.genoscope.cns.fr/ 
spip/) and FlyBase (http://flybase.org/). Five additional 
hymenopteran genomes were obtained from the Four- 
midable database [29]. A complete list of all the gen- 
omes analyzed in this study and their sources is given in 
Additional file 5. Species were selected only if their gen- 
ome is larger than 10 Mb with sequence coverage suffi- 
cient to represent their entire genome, which was 
labeled as complete or draft by the corresponding 
sequencing center or by the GOLD database (http:// 
www.genomesonline.org/). Reference element sequences 
that were used in the alignment profile design, MCL 
clustering, and phylogenetic analysis correspond to the 
DIRSl-like sequences that we could access in GenBank, 
Retrobase (http://biocadmin.otago.ac.nz/fmi/xsl/retro- 
base/home.xsl) and Repbase Update database version 
14.06 (http://www.girinst.org/repbase/update/index. 
html). 

Identification of DIRSl-like retrotransposons 

We propose a new computational tool for DIRSl-like 
retrotransposon identification, ReDoSt (Additional file 4, 
updates available at http://wwwabi.snv.jussieu.fr/public/ 
ReDoSt/), based on both similarity searches of domains 
and their organization in the element structure. The 
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similarity searches were performed using the RPS- 
BLAST and PSI-BLAST programs [32] with an E-value 
cutoff of 0.01 and specific alignment profiles for each 
domain. This method, in comparison with BLAST or 
RepeatMasker approaches, may be more permissive and 
thus allow for the identification of more divergent ele- 
ments. For example, using this method we identified 21 
DIRSl-like copies in the A. macrogynus genome, 
whereas only 16 well-conserved elements (i.e. simulta- 
neous detection of the RT, MT and YR domains) were 
detected using RepeatProteinMask and the RepeatPeps 
library (included in the RepeatMasker package). We 
used three different profiles whose positions within the 
element are shown in Figure 1. For the RT-encoding 
domain, we used the alignment profile 'cd03714' (118 
amino acids, Conserved, Domain Database, http://www. 
ncbi.nlm.nih.gov/). For the remaining two encoding 
domains, we used two specific alignment profiles (282 
and 93 amino acids for the YR and MT profiles, respec- 
tively) that we designed using DIRSl-like reference ele- 
ment alignments (Additional file 4, http://wwwabi.snv. 



jussieu.fr/public/ReDoSt/). Our automatic detection tool 
is composed of six main steps (Figure 7): (1) Identifica- 
tion of all putative reverse transcriptase-encoding frag- 
ments within the genome; (2) Extraction of each 
genomic hit with 5-kb flanking sequence on both sides 
because all DIRSl-like elements described to date are 
less than 6 kb in length; Within each genomic fragment 
retained, (3) tyrosine recombinase-encoding domain 
search and (4) methyltransferase-encoding domain 
search; (5) After obtaining the 10-kb contigs that harbor 
the three characteristic domains (RT, YR and MT) of 
DIRSl-like retrotransposons, we checked the co-orienta- 
tion and the order of these domains to discriminate 
other types of YR-encoding retrotransposons (e.g., 
Ngaro and PAT elements); (6) Finally, fragments that 
harbor at least two occurrences of the same domain 
were set aside for copy number estimation, sequence 
alignments, and supplementary investigations required 
to determine from which rearrangements (duplications 
or insertions) they are derived. Such a fragment has 
then been considered a single copy of DIRSl-like 



Genome in FASTA format (e.g., Rhizopus oryzae ) 



Step 1 : Identification of Reverse Transcriptase encoding 
domains 



(z) 383 fragments 



< < 

Step 2: Extraction of a fragment for each Reverse 
Transcriptase hit with 5kb flanking sequences 



Step 3: Identification of Tyrosine Recombinase encoding 
domains within the 10kb extracted fragments 



Step 4: Identification of Methyltransferase encoding 
domains within the 10kb extracted fragments 



Step 5: Filter for the co-orientation and the order of the 
three coding domains 



Step 6: Set aside the fragments harboring at least two 
occurrences of one domain 



(z) 39 fragments 



(z) 27 fragments 



(z) 25 fragments 



(z) 23 fragments 



Figure 7 ReDoSt pipeline developed in this study for the identification of DIRS1 -like retrotransposons To assess the efficiency of each 
step of the pipeline, we detailed the number of fragments retained after each step for the genome of the fungus Rhizopus oryzae. 
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element in copy number estimation. We repeatedly 
observed a bottleneck between the first and fourth steps 
for all of the genomes tested (the example of R. oryzae 
results given in Figure 7). We chose to be less stringent 
in the first step by using an alignment profile designed 
using a large diversity of elements, one third represented 
by other types of tyrosine recombinase-encoding retro- 
transposons as well as one Gypsy element. As a conse- 
quence, many reverse transcriptase-encoding fragments 
identified may belong to other retrotransposon superfa- 
milies. Analyses were performed on an iDataPlex Linux 
system (CPU 2.53 GHz, 3 GB memory). 

Sequence analysis 

Families of DIRSl-like elements were identified by clus- 
tering all the nucleotide reverse transcriptase-encoding 
fragment sequences detected in the 32 species with the 
MCL program (http://www.micans.org/mcl/, [33]). 
Reference elements previously described and/or depos- 
ited in Repbase version 14.06 were also added to the 
dataset. This method was used in previous studies on IS 
transposons [34,35]. An E-value cutoff of 0.01 was used 
for the initial BLASTN search. An inflation factor of 1.2 
was computed to cluster sequences. These values are 
effective at least in splitting elements of different pre- 
viously defined DIRSl-like families (e.g., DrDIRSl, 
DrDIRS2 and DrDIRS3 in D. rerio [5]). Because cluster- 
ing results can depend on the dataset used, we tested 
two different approaches: an independent clustering of 
the elements within each tested genome and a global 
clustering of all elements from all species. Similar results 
were obtained regardless of the approach used (data not 
shown), suggesting that the clusters obtained are well- 
supported. 

To perform the element annotation, we preferentially 
selected elements from species in which DIRSl-like ret- 
rotransposons were not previously reported or from 
families showing high copy number. The repetitive 
structures (ICRs and ITRs) were detected using UGENE 
(http://ugene.unipro.ru/index.html). When several copies 
of a family were available for one species, the bound- 
aries of the ITRs were manually analyzed and detection 
of the flanking regions in multiple nucleotide sequence 
alignments carried out using MUSCLE [36]. To check 
the presence of ORF overlaps, we used the ORF Finder 
tool (http://www.ncbi.nlm.nih.gov/projects/gorf/). 

For phylogenetic analyses, a sequence from each 
family was included that required none or only minor 
corrections in its pol sequence (no large indels or multi- 
ple frameshifts). The amino acid pol sequence multiple 
alignments were performed with MUSCLE and ambigu- 
ously aligned sites were removed using Gblocks [37]. 
Phylogenetic analyses were conducted using neighbor- 
joining (NJ) method and the pairwise deletion option of 



the MEGA5.0 software [38]. The best-fit model, the JTT 
model [39] with gamma distribution, was selected with 
Topali2 software [40] and support for individual groups 
was evaluated with non-parametric bootstrapping [41] 
using 100 replicates. 

Description of additional data files 

The following additional data are available with the 
online version of this paper. Additional data file 1 con- 
tains two histograms representing the distribution of the 
domain sizes for the elements detected in A. carolinensis 
and 5. kowalevskii. Additional data file 2 contains histo- 
grams of the distribution of family size in several spe- 
cies. Additional data file 3 provides a table listing 
features of the 28 DIRSl-like annotated elements. Addi- 
tional data file 4 is a mini-website providing an access 
to the ReDoSt pipeline, to the different alignments pro- 
files and to the DIRSl-like sequences used to design 
them. Additional data file 5 is a list reporting the data 
source for all species tested. 

Additional material 



Additional file 1: Domain size distributions for the elements 
detected in A. carolinensis (A) and S. kowalevskii (B) The histogram 
represents the number of element domains detected (y-axis) as a 
function of their length (x-axis). The reverse transcriptase fragments are 
represented in blue, the methyltransferase fragments in red, and the 
tyrosine recombinase fragments in yellow. 

Additional file 2: Distribution of family size Families are arranged 
along a gradient of decreasing size. For each species, mean family size 
and standard deviation are given. X-axis: family rank, Y-axis: number of 
elements in the family. 

Additional file 3: Annotation of the 28 DIRSl-like elements 
described. For each element, positions of the repeated sequences 
within elements, the tyrosine recombinase and pol conserved motifs 
(reverse transcriptase (RT), RNase H (RH) and methyltransferase domains), 
and the end of the putative pol region are reported. The position of 
each element within the genome sequences is also provided. 

Additional file 4: ReDoSt pipeline and alignment profiles used in 
this study. 

Additional file 5: List of all species tested For each species, the 
acronym used during the study and the data source website are 
indicated. 



List of abbreviations used 

ICR: Internal Complementary Region; ITR: Inversed Terminal Repeat; IE: left 
Extension; LINE: Long Interspersed Element; LTR: Long Terminal Repeat; ITer: 
left Terminus; MT: MethylTransferase; rE: right Extension; RH: RnaseH; RT: 
Reverse Transcriptase; rTer: right Terminus; SDR: "Split" Direct Repeat; SINE: 
Short Interspersed Element; YR: Tyrosine Recombinase. 
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